Predicting Vaccine Interest in the U.S.

Dataset is located at: https://www.kaggle.com/GoogleNewsLab/health-searches-us-county

  • Make sure to downloaded and unzip the data

Vaccinations have saved millions of lives globally by eradicating diseases such as smallpox in the 1970's. There has been increasing interest in vaccinations on Google from 2008 through 2017, and while there has been broader coverage of vaccination across the US, there are still many children who are not vaccinated (https://stacks.cdc.gov/view/cdc/59413).

There may be many reasons for looking up vaccines on Google: preparation for travel abroad, an attempt to gain more information about specific vaccinations for children, pets, research, etc. The list is endless, but there can still be value in finding how interest in vaccines is changing in the U.S. over time, and how the interest varies by state.

While Google can provide great resources and general information, it does not always provide credible information and many times will confuse people with conflicting information across various websites. Education about vaccines should be provided by medical professionals.

This analysis has the potential to be turned into a tool with Google's search API which could provide realtime information for medical health providers or organizations that aim to provide professional advice and education about vaccination. Medical providers could determine which US states have the highest interest in vaccination, and therefore help to determine where resources should be allocated for maximum efficacy. If interest is high in certain states, the medical health providers or organizations could begin researching why interest is high or low with surveys for patients and people in various communities. Based on survey results, the health providers could decide which locations need vaccine education the most.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import requests
import zipfile
import io
from collections import Counter
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
from plotly.offline import iplot
import plotly.express as px 
import plotly.graph_objects as go
import warnings
from numpy.linalg import LinAlgError
import time
In [2]:
# install watermark extension
!pip install watermark

# once it is installed, you'll just need this in future notebooks:
%load_ext watermark

%watermark -a "Chantel Clark" -d -t -v -p numpy,pandas,matplotlib,seaborn
Requirement already satisfied: watermark in c:\users\chant\anaconda3\lib\site-packages (2.0.2)
Requirement already satisfied: ipython in c:\users\chant\anaconda3\lib\site-packages (from watermark) (7.8.0)
Requirement already satisfied: setuptools>=18.5 in c:\users\chant\anaconda3\lib\site-packages (from ipython->watermark) (41.4.0)
Requirement already satisfied: pygments in c:\users\chant\anaconda3\lib\site-packages (from ipython->watermark) (2.4.2)
Requirement already satisfied: decorator in c:\users\chant\anaconda3\lib\site-packages (from ipython->watermark) (4.4.0)
Requirement already satisfied: jedi>=0.10 in c:\users\chant\anaconda3\lib\site-packages (from ipython->watermark) (0.15.1)
Requirement already satisfied: traitlets>=4.2 in c:\users\chant\anaconda3\lib\site-packages (from ipython->watermark) (4.3.3)
Requirement already satisfied: backcall in c:\users\chant\anaconda3\lib\site-packages (from ipython->watermark) (0.1.0)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in c:\users\chant\anaconda3\lib\site-packages (from ipython->watermark) (2.0.10)
Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\chant\anaconda3\lib\site-packages (from ipython->watermark) (0.4.1)
Requirement already satisfied: pickleshare in c:\users\chant\anaconda3\lib\site-packages (from ipython->watermark) (0.7.5)
Requirement already satisfied: parso>=0.5.0 in c:\users\chant\anaconda3\lib\site-packages (from jedi>=0.10->ipython->watermark) (0.5.1)
Requirement already satisfied: six in c:\users\chant\anaconda3\lib\site-packages (from traitlets>=4.2->ipython->watermark) (1.12.0)
Requirement already satisfied: ipython-genutils in c:\users\chant\anaconda3\lib\site-packages (from traitlets>=4.2->ipython->watermark) (0.2.0)
Requirement already satisfied: wcwidth in c:\users\chant\anaconda3\lib\site-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython->watermark) (0.1.7)
Chantel Clark 2020-04-01 07:15:41 

CPython 3.7.4
IPython 7.8.0

numpy 1.16.5
pandas 0.25.1
matplotlib 3.1.1
seaborn 0.9.0

This cell with open a window where you can select the data csv file you downloaded.

In [3]:
"""
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
"""
Out[3]:
'\nfrom google.colab import files\n\nuploaded = files.upload()\n\nfor fn in uploaded.keys():\n  print(\'User uploaded file "{name}" with length {length} bytes\'.format(\n      name=fn, length=len(uploaded[fn])))\n'
In [4]:
#file='RegionalInterestByConditionOverTime.csv'
file = r'C:\Users\chant\OneDrive\Documents\Springboard\Data\RegionalInterestByConditionOverTime.csv'
healthSearchData=pd.read_csv(file)
healthSearchData.head()
Out[4]:
dma geoCode 2004+cancer 2004+cardiovascular 2004+stroke 2004+depression 2004+rehab 2004+vaccine 2004+diarrhea 2004+obesity ... 2016+diabetes 2017+cancer 2017+cardiovascular 2017+stroke 2017+depression 2017+rehab 2017+vaccine 2017+diarrhea 2017+obesity 2017+diabetes
0 Portland-Auburn ME 500 44 6 17 39 21 31 14 29 ... 81 70 37 83 64 56 76 66 47 80
1 New York NY 501 47 6 13 38 16 33 12 27 ... 77 70 34 53 56 53 79 56 52 78
2 Binghamton NY 502 48 3 16 50 12 37 24 31 ... 74 68 24 71 69 44 77 78 61 72
3 Macon GA 503 44 14 14 37 19 49 14 29 ... 78 53 38 62 46 60 47 53 41 66
4 Philadelphia PA 504 52 7 16 41 23 36 14 30 ... 80 75 35 61 62 75 84 69 56 78

5 rows × 128 columns

In [5]:
# Check for null values
for col in healthSearchData.columns:
  bool_series = pd.isnull(healthSearchData[col])
  if len(healthSearchData[bool_series].index.values) != 0:
    print(healthSearchData[bool_series].index.values)
In [6]:
# Create a 'state_abbrev' column by extracting state, then map state to U.S. region
temp= [] # initialize empty list

for i in healthSearchData['dma']:
  temp.append(i[-2:]) # add last two letters of dma to temp list

healthSearchData['state_abbrev'] = temp # create new column 'state_abbrev'
In [7]:
# Fix index 11 state_abbrev, dma formatting different from other dma's
healthSearchData.set_value(11, 'state_abbrev', 'DC')
healthSearchData.iloc[11]
C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning:

set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead

Out[7]:
dma                    Washington DC (Hagerstown MD)
geoCode                                          511
2004+cancer                                       47
2004+cardiovascular                                5
2004+stroke                                       15
                                   ...              
2017+vaccine                                      79
2017+diarrhea                                     55
2017+obesity                                      59
2017+diabetes                                     73
state_abbrev                                      DC
Name: 11, Length: 129, dtype: object
In [8]:
# Dictionary of state abbreviations to region ('West', 'Midwest', 'Northeast', 'South', and 'O' = other)

states_region = {
        'AK': 'West',
        'AL': 'South',
        'AR': 'South',
        'AS': 'O',
        'AZ': 'West',
        'CA': 'West',
        'CO': 'West',
        'CT': 'Northeast',
        'DC': 'South',
        'DE': 'Northeast',
        'FL': 'South',
        'GA': 'South',
        'GU': 'O',
        'HI': 'West',
        'IA': 'Midwest',
        'ID': 'West',
        'IL': 'Midwest',
        'IN': 'Midwest',
        'KS': 'Midwest',
        'KY': 'South',
        'LA': 'South',
        'MA': 'Northeast',
        'MD': 'South',
        'ME': 'Northeast',
        'MI': 'Midwest',
        'MN': 'Midwest',
        'MO': 'Midwest',
        'MP': 'O',
        'MS': 'South',
        'MT': 'West',
        'NA': 'O',
        'NC': 'South',
        'ND': 'Midwest',
        'NE': 'Midwest',
        'NH': 'Northeast',
        'NJ': 'Northeast',
        'NM': 'West',
        'NV': 'West',
        'NY': 'Northeast',
        'OH': 'Midwest',
        'OK': 'South',
        'OR': 'West',
        'PA': 'Northeast',
        'PR': 'O',
        'RI': 'Northeast',
        'SC': 'South',
        'SD': 'Midwest',
        'TN': 'South',
        'TX': 'South',
        'UT': 'West',
        'VA': 'South',
        'VI': 'O',
        'VT': 'Northeast',
        'WA': 'West',
        'WI': 'Midwest',
        'WV': 'South',
        'WY': 'West'
}
In [9]:
# Create a column 'region' that maps state abbreviation to region
healthSearchData['region'] = healthSearchData['state_abbrev'].map(states_region)
In [10]:
healthSearchData.head()
Out[10]:
dma geoCode 2004+cancer 2004+cardiovascular 2004+stroke 2004+depression 2004+rehab 2004+vaccine 2004+diarrhea 2004+obesity ... 2017+cardiovascular 2017+stroke 2017+depression 2017+rehab 2017+vaccine 2017+diarrhea 2017+obesity 2017+diabetes state_abbrev region
0 Portland-Auburn ME 500 44 6 17 39 21 31 14 29 ... 37 83 64 56 76 66 47 80 ME Northeast
1 New York NY 501 47 6 13 38 16 33 12 27 ... 34 53 56 53 79 56 52 78 NY Northeast
2 Binghamton NY 502 48 3 16 50 12 37 24 31 ... 24 71 69 44 77 78 61 72 NY Northeast
3 Macon GA 503 44 14 14 37 19 49 14 29 ... 38 62 46 60 47 53 41 66 GA South
4 Philadelphia PA 504 52 7 16 41 23 36 14 30 ... 35 61 62 75 84 69 56 78 PA Northeast

5 rows × 130 columns

In [11]:
# Drop the 'geoCode' column
healthSearchData = healthSearchData.drop('geoCode', axis=1)
In [12]:
healthSearchData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Columns: 129 entries, dma to region
dtypes: int64(126), object(3)
memory usage: 211.8+ KB
In [13]:
healthSearchData.describe().T.head()
Out[13]:
count mean std min 25% 50% 75% max
2004+cancer 210.0 43.904762 7.618944 27.0 40.0 43.0 47.0 100.0
2004+cardiovascular 210.0 7.433333 7.909647 0.0 5.0 6.0 9.0 100.0
2004+stroke 210.0 17.642857 8.135284 0.0 14.0 16.0 18.0 100.0
2004+depression 210.0 45.623810 13.715720 0.0 37.0 44.0 51.0 100.0
2004+rehab 210.0 18.890476 10.157723 0.0 15.0 17.0 21.0 100.0

The data here comes from 210 metropolitan cities from across the U.S. Scores have a range of 0-100, and is described on Google Search API as an 'interest score'. They are proportional to the fraction of all Google searches, so a score of 100 represents very high interest, and a score of 50 is half as popular as the previous search. A score of 0 indicates that there was not enough data.

Caution must be used in interpreting this data; it is not possible to obtain the total count of searches because the scores are proportional. This means that a larger city with half of the queries related to vaccines would receive a score of 50, and a smaller city where $\frac{3}{4}$ of queries were related to vaccines would receive a score of 75.

Because the data includes proportional health interest scores and not total counts, the best way to use this data would be to look at relative increases and decreases in the interest scores. For example:

  • Are there any patterns in the regions which contain high outliers?
  • How is interest in vaccines changing over time and space?
  • Are there differences in interest score between the four U.S. regions (Northeast, South, West, Midwest)?

This way, even if there were a larger volume of searches in the most recent years because of increased internet use, the mean score should not be greatly affected by the increased volume of users. Realistically though, the increased access to internet for different types of users (of various ages and wealth classes) will influence the outcomes, and should be considered when interpreting the data.

Boxplot and Swarmplot

The boxplot is useful for identifying outliers, but we can get more information from the swarmplots. Looking at all of the data as a swarmplot can help us to understand which region of the U.S. the outliers are coming from.

In [14]:
# Create 'vaccine' search dataframe with columns for year, count, region.  
years = []
vacc_all = pd.DataFrame()
for col in healthSearchData.columns:
    if '+' and 'vaccine' in col:
        year = col.split('+')[0]
        years.append(year)
        add_col = healthSearchData[col] 
        vacc_all = pd.concat([vacc_all, add_col], axis=1)
In [15]:
vacc_all.describe().T
Out[15]:
count mean std min 25% 50% 75% max
2004+vaccine 210.0 32.476190 14.768773 0.0 25.0 31.0 38.00 100.0
2005+vaccine 210.0 29.342857 12.005843 0.0 23.0 27.5 33.00 100.0
2006+vaccine 210.0 38.152381 13.964604 0.0 31.0 37.0 44.00 100.0
2007+vaccine 210.0 40.928571 11.263255 0.0 35.0 40.0 46.00 100.0
2008+vaccine 210.0 24.104762 8.352204 0.0 20.0 23.0 26.00 100.0
2009+vaccine 210.0 35.447619 8.546502 0.0 31.0 35.0 40.00 100.0
2010+vaccine 210.0 45.404762 10.706300 0.0 41.0 45.0 50.00 100.0
2011+vaccine 210.0 57.947619 10.871239 0.0 51.0 57.0 64.00 100.0
2012+vaccine 210.0 56.985714 10.130006 0.0 51.0 57.0 62.00 100.0
2013+vaccine 210.0 62.628571 10.163503 0.0 57.0 63.0 68.00 100.0
2014+vaccine 210.0 61.690476 11.051615 0.0 56.0 63.0 68.00 100.0
2015+vaccine 210.0 62.385714 12.239194 0.0 56.0 63.0 69.00 100.0
2016+vaccine 210.0 62.985714 9.974359 0.0 57.0 63.0 68.75 100.0
2017+vaccine 210.0 72.800000 10.983154 0.0 67.0 73.0 79.00 100.0
In [16]:
vacc_all.head()
Out[16]:
2004+vaccine 2005+vaccine 2006+vaccine 2007+vaccine 2008+vaccine 2009+vaccine 2010+vaccine 2011+vaccine 2012+vaccine 2013+vaccine 2014+vaccine 2015+vaccine 2016+vaccine 2017+vaccine
0 31 36 48 42 23 40 45 64 68 70 72 64 64 76
1 33 29 38 41 23 36 42 51 47 60 57 58 64 79
2 37 21 35 45 29 43 46 48 56 66 66 73 72 77
3 49 33 50 42 24 24 41 54 57 60 53 53 56 47
4 36 33 42 47 27 40 49 67 62 69 68 69 70 84
In [17]:
# Rename columns with years only
vacc_all.columns = years
# Add 'region' column
vacc_all['region'] = healthSearchData['region']

# Melt dataframe to have only columns 'region', 'year', and 'value'
melted_df = pd.melt(vacc_all, id_vars='region', var_name='year' )
In [18]:
# Create a box plot for vaccine query
sns.set(font_scale=1.5)
sns.set_style('whitegrid')

fig = plt.gcf()
fig.set_size_inches(15, 8)
sns.boxplot("year", "value", data=melted_df)
plt.title('Vaccine Search Interest Scores')
plt.xlabel('Year')
plt.ylabel('Interest Score')
Out[18]:
Text(0, 0.5, 'Interest Score')
In [19]:
# Swarm plot
fig = plt.gcf()
fig.set_size_inches(15, 8)
ax = fig.add_subplot(111)

s = sns.swarmplot(x = 'year', y='value', hue='region', data=melted_df, dodge=True)
plt.title('Vaccine Search Interest Scores')
plt.xlabel('Year')
plt.ylabel('Interest Score')
# Place legend outside of plot
handles, labels = ax.get_legend_handles_labels()
l = plt.legend(handles[0:4], labels[0:4], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

# Count the number of zeros and total number of data points
num_zeros = len(melted_df[melted_df.value == 0])
total_n = len(melted_df)
# Add counts text to plot
s.text(10+0.2, 6.5, "zeros: " + str(num_zeros), horizontalalignment='left', size='medium', color='black', weight='semibold')
s.text(10+0.2, 2.5, "total n: " + str(total_n), horizontalalignment='left', size='medium', color='black', weight='semibold')
Out[19]:
Text(10.2, 2.5, 'total n: 2940')

Scores of 100 come from all regions, with the least number of 100 scores from the Northeast. There was only one instance of a 100 score in 2004 from the Northeast.

There were a total of 32 zero scores, which is 1.1% of the vaccine subset. Only a small percentage have a score of zero, so it is fine to omit zero scores.

In [20]:
# vacc_all dataframe, each row is a city
# Add 'state' column
vacc_all['state'] = healthSearchData['state_abbrev']
vacc_all.head()
Out[20]:
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 region state
0 31 36 48 42 23 40 45 64 68 70 72 64 64 76 Northeast ME
1 33 29 38 41 23 36 42 51 47 60 57 58 64 79 Northeast NY
2 37 21 35 45 29 43 46 48 56 66 66 73 72 77 Northeast NY
3 49 33 50 42 24 24 41 54 57 60 53 53 56 47 South GA
4 36 33 42 47 27 40 49 67 62 69 68 69 70 84 Northeast PA
In [21]:
# Count the number of entries/cities per state
#from collections import Counter

Counter(vacc_all.state)
Out[21]:
Counter({'ME': 3,
         'NY': 10,
         'GA': 7,
         'PA': 6,
         'MI': 7,
         'NH': 1,
         'IN': 6,
         'OH': 9,
         'DC': 1,
         'MD': 2,
         'NC': 5,
         'SC': 4,
         'MA': 2,
         'FL': 9,
         'KY': 3,
         'VA': 6,
         'CT': 1,
         'TN': 5,
         'WV': 4,
         'TX': 17,
         'IL': 6,
         'KS': 3,
         'MO': 6,
         'AL': 4,
         'MN': 3,
         'LA': 6,
         'WI': 6,
         'IA': 4,
         'OK': 4,
         'AR': 4,
         'MS': 6,
         'NE': 4,
         'ND': 2,
         'SD': 2,
         'AK': 3,
         'HI': 1,
         'CO': 3,
         'AZ': 2,
         'MT': 6,
         'ID': 3,
         'WY': 1,
         'UT': 1,
         'CA': 12,
         'NM': 1,
         'OR': 4,
         'WA': 3,
         'NV': 2})
In [22]:
# Melt dataframe to have only columns 'region', 'state', 'year', and 'score'
melted_df = pd.melt(vacc_all, id_vars=['region','state'], var_name='year', value_name='score')
melted_df.head()
Out[22]:
region state year score
0 Northeast ME 2004 31
1 Northeast NY 2004 33
2 Northeast NY 2004 37
3 South GA 2004 49
4 Northeast PA 2004 36
In [23]:
# Remove zeros, and replace with nan, before calculating state mean score
melted_df = melted_df.replace(0, np.nan)
In [24]:
# Find all of the locations with scores of 100
melted_df[melted_df.score == 100]
Out[24]:
region state year score
26 Northeast NY 2004 100.0
121 South MS 2004 100.0
362 South MS 2005 100.0
540 South LA 2006 100.0
751 South MS 2007 100.0
917 Midwest MI 2008 100.0
1127 Midwest MI 2009 100.0
1342 South WV 2010 100.0
1562 Midwest MN 2011 100.0
1772 Midwest MN 2012 100.0
1982 Midwest MN 2013 100.0
2180 South FL 2014 100.0
2477 West AK 2015 100.0
2612 Midwest MN 2016 100.0
2810 South FL 2017 100.0
In [25]:
# Find the mean score for each state, in each year
grouped_df = melted_df.groupby(['year', 'state']).mean()
grouped_df.head()
Out[25]:
score
year state
2004 AK 32.000000
AL 38.000000
AR 31.500000
AZ 30.500000
CA 35.833333
In [26]:
# Reset index to make year and state columns
grouped_df.reset_index(inplace=True)
grouped_df.head()
Out[26]:
year state score
0 2004 AK 32.000000
1 2004 AL 38.000000
2 2004 AR 31.500000
3 2004 AZ 30.500000
4 2004 CA 35.833333

Interactive Map With Slider

In [27]:
grouped_df.head()
Out[27]:
year state score
0 2004 AK 32.000000
1 2004 AL 38.000000
2 2004 AR 31.500000
3 2004 AZ 30.500000
4 2004 CA 35.833333
In [28]:
""""
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))
configure_plotly_browser_state()
"""
Out[28]:
'"\ndef configure_plotly_browser_state():\n  import IPython\n  display(IPython.core.display.HTML(\'\'\'\n        <script src="/static/components/requirejs/require.js"></script>\n        <script>\n          requirejs.config({\n            paths: {\n              base: \'/static/base\',\n              plotly: \'https://cdn.plot.ly/plotly-latest.min.js?noext\',\n            },\n          });\n        </script>\n        \'\'\'))\nconfigure_plotly_browser_state()\n'
In [29]:
# Create slider
data_slider = []

for year in grouped_df.year.unique():
    # Select the year
    one_year = grouped_df[grouped_df.year == year]

    for col in grouped_df.columns:  # Transform the columns into string type
        one_year[col] = one_year[col].astype(str)

    ### Create the text for mouse-hover for each state, for the current year    
    one_year['text'] = one_year['state'] + '<br>Score: ' + one_year['score']

    ### Create the dictionary with the data for the current year
    data_one_year = dict(
                        type='choropleth',
                        locations = one_year['state'],
                        z=one_year['score'].astype(float),
                        locationmode='USA-states',
                        zmin=20,
                        zmax=100,
                        text = one_year['text'],
                        hoverinfo='text',
                        )

    data_slider.append(data_one_year)  # Add the dictionary to the list of dictionaries for the slider
C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [30]:
# Create the steps for the slider

steps = []

for i in range(len(data_slider)):
    step = dict(method='restyle',
                args=['visible', [False] * len(data_slider)],
                label='Year {}'.format(i + 2004)) # label to be displayed for each step (year)
    step['args'][1][i] = True
    steps.append(step)
In [31]:
# Create the 'sliders' object from the 'steps' 
sliders = [dict(active=0, pad={"t": 1}, steps=steps)] 
In [32]:
#from plotly.offline import iplot
#configure_plotly_browser_state() 

#set up the layout (including slider option)
layout = dict(geo=dict(scope='usa',
                       projection={'type': 'albers usa'},
                       showlakes=False),
              sliders=sliders)
	
# create the figure object:
fig = dict(data=data_slider, layout=layout) 

# plot 
iplot(fig)

Interest across the states drop during 2008, with a large increase from 2009-2011. While the Northeast had the least number of high 100 scores, the Northeast appears to consistently have high interest in vaccines compared to the South.

Correlation Heat Maps

In [33]:
state_df = grouped_df.pivot(index='year', columns='state', values='score')
state_df.head()
Out[33]:
state AK AL AR AZ CA CO CT DC FL GA ... SC SD TN TX UT VA WA WI WV WY
year
2004 32.000000 38.00 31.50 30.5 35.833333 25.333333 43.0 46.0 28.666667 35.142857 ... 24.75 46.5 21.8 31.933333 21.0 33.166667 31.000000 33.833333 41.750000 27.0
2005 33.000000 27.00 27.75 29.5 27.250000 26.000000 35.0 42.0 21.666667 28.714286 ... 27.00 34.0 25.6 34.176471 22.0 26.666667 27.000000 29.166667 21.333333 45.0
2006 43.000000 32.50 39.25 35.5 35.083333 32.666667 46.0 55.0 31.111111 36.285714 ... 31.75 57.0 36.0 33.266667 31.0 39.833333 34.000000 44.833333 54.000000 23.0
2007 44.333333 35.75 30.00 35.5 33.083333 35.333333 47.0 52.0 38.111111 39.428571 ... 40.25 43.5 40.0 39.941176 32.0 45.000000 42.000000 45.333333 41.250000 55.0
2008 24.333333 20.75 19.00 23.0 20.750000 24.333333 26.0 21.0 20.333333 20.428571 ... 21.50 24.0 21.0 22.176471 20.0 24.500000 24.333333 25.166667 37.250000 18.0

5 rows × 47 columns

In [34]:
# Is there a correlation between states?
# Create a correlation / heat map

sns.set(font_scale=1.25)
# Create a correlation matrix of means
corr = state_df.corr(method='pearson')

# Seaborn heatmap
plt.figure(figsize=(17,17))
ax = sns.heatmap(
    corr, 
    vmin=0.6, vmax=1, center=0.8,
    cmap='coolwarm',
    square=True,
    robust=True,
    annot_kws={"size":10},
    cbar_kws={"shrink": 0.5}
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=90
);

plt.title('Pearson Correlation Matrix of Mean Scores', fontdict = {'fontsize' : 20})
Out[34]:
Text(0.5, 1, 'Pearson Correlation Matrix of Mean Scores')
In [35]:
# What is going on with Mississippi, are the scores from MS really different from the other states?

MS = grouped_df[grouped_df.state == 'MS']
AL = grouped_df[grouped_df.state == 'AL']
CO = grouped_df[grouped_df.state == 'CO']

# Plot MS mean scores over time
sns.set(font_scale=1.25)
sns.set_style('whitegrid')
plt.figure(figsize=(12,5))
plt.plot(MS.year, MS.score)
plt.plot(AL.year, AL.score)
plt.plot(CO.year, CO.score)
plt.ylim(0,100)
plt.title('Mississippi vs. Alabama Vaccine Interest')
plt.xlabel('Year')
plt.ylabel('Average Score')
plt.legend(loc='best', labels=['MS', 'AL','CO'])
Out[35]:
<matplotlib.legend.Legend at 0x1c9be58c448>
In [36]:
vacc_all.head()
Out[36]:
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 region state
0 31 36 48 42 23 40 45 64 68 70 72 64 64 76 Northeast ME
1 33 29 38 41 23 36 42 51 47 60 57 58 64 79 Northeast NY
2 37 21 35 45 29 43 46 48 56 66 66 73 72 77 Northeast NY
3 49 33 50 42 24 24 41 54 57 60 53 53 56 47 South GA
4 36 33 42 47 27 40 49 67 62 69 68 69 70 84 Northeast PA
In [37]:
# Is there a correlation between years?
# Create a correlation / heat map

sns.set(font_scale=1.25)
# Create a correlation matrix of means
corr = vacc_all.corr(method='pearson')

# Seaborn heatmap
plt.figure(figsize=(10,10))
ax = sns.heatmap(
    corr, 
    vmin=0, vmax=1, center=0.5,
    cmap='coolwarm',
    square=True,
    robust=True,
    annot=True,
    annot_kws={"size":10},
    cbar_kws={"shrink": 0.5}
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=60,
    horizontalalignment='right'
);

plt.title('Pearson Correlation Matrix of Mean Scores', fontdict = {'fontsize' : 20})
Out[37]:
Text(0.5, 1, 'Pearson Correlation Matrix of Mean Scores')

The data has low correlation from 2004-2010, with increasing correlation for years 2011-2017.

Machine Learning / Predictive Analysis

In [38]:
grouped_df.head()
Out[38]:
year state score
0 2004 AK 32.000000
1 2004 AL 38.000000
2 2004 AR 31.500000
3 2004 AZ 30.500000
4 2004 CA 35.833333
In [39]:
# Subset California data
ca_df = grouped_df[grouped_df.state == 'CA']
ca_df = ca_df.drop('state', axis=1)
ca_df.head()
Out[39]:
year score
4 2004 35.833333
51 2005 27.250000
98 2006 35.083333
145 2007 33.083333
192 2008 20.750000
In [40]:
sns.set(font_scale=1.25)
sns.set_style("whitegrid")

plt.figure(figsize=(12,5))
ax = plt.subplot(111)

plt.plot(ca_df.year, ca_df.score)
plt.ylim(0,100)
plt.title("California Vaccine Search Interest", fontsize=20)
plt.xlabel('Year')
plt.ylabel('Mean Score')
Out[40]:
Text(0, 0.5, 'Mean Score')
In [41]:
# Use datetime formatting
ca_df['year'] = pd.to_datetime(ca_df['year'], format="%Y")
# Rename column 'year' to 'date'
ca_df.columns = ['date', 'score']
ca_df.head()
Out[41]:
date score
4 2004-01-01 35.833333
51 2005-01-01 27.250000
98 2006-01-01 35.083333
145 2007-01-01 33.083333
192 2008-01-01 20.750000
In [42]:
ca_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 4 to 615
Data columns (total 2 columns):
date     14 non-null datetime64[ns]
score    14 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 336.0 bytes
In [43]:
# Set year as index
ca_df.set_index('date', inplace=True)
ca_df.head()
Out[43]:
score
date
2004-01-01 35.833333
2005-01-01 27.250000
2006-01-01 35.083333
2007-01-01 33.083333
2008-01-01 20.750000
In [44]:
# Interpolate to increase 
interp_df = ca_df.resample('M').mean().interpolate(method='linear')
interp_df.head(13)
Out[44]:
score
date
2004-01-31 35.833333
2004-02-29 35.118056
2004-03-31 34.402778
2004-04-30 33.687500
2004-05-31 32.972222
2004-06-30 32.256944
2004-07-31 31.541667
2004-08-31 30.826389
2004-09-30 30.111111
2004-10-31 29.395833
2004-11-30 28.680556
2004-12-31 27.965278
2005-01-31 27.250000
In [45]:
# Calculate the rolling statistics with window using 3 observations
roll_mean = interp_df.rolling(window=3).mean()
roll_sd = interp_df.rolling(window=3).std()
#roll_mean.head()
In [46]:
interp_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 157 entries, 2004-01-31 to 2017-01-31
Freq: M
Data columns (total 1 columns):
score    157 non-null float64
dtypes: float64(1)
memory usage: 2.5 KB
In [47]:
# Plot the data with rolling stats
plt.figure(figsize=(15,6))

plt.plot(interp_df, label='Original Data')
plt.plot(roll_mean, color='red', label='Rolling Mean')
plt.plot(roll_sd, color='black', label='Rolling Standard Deviation')
plt.title("Rolling Mean and Standard Deviation")
plt.legend(loc='best', fontsize='small')
C:\Users\chant\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\converter.py:103: FutureWarning:

Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()

Out[47]:
<matplotlib.legend.Legend at 0x1c9beadfcc8>
In [48]:
# Dickey-Fuller test
# from statsmodels.tsa.stattools import adfuller

dftest = adfuller(interp_df['score'], autolag='AIC')

dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', 'n_lags', 'n_observations'])

for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value

print(dfoutput)
Test Statistic           -0.389136
p-value                   0.911897
n_lags                   13.000000
n_observations          143.000000
Critical Value (1%)      -3.476927
Critical Value (5%)      -2.881973
Critical Value (10%)     -2.577665
dtype: float64
In [49]:
def test_stationarity(timeseries):
    # Calculate the rolling statistics with window using 3 observations
    roll_mean = timeseries.rolling(window=3).mean()
    roll_sd = timeseries.rolling(window=3).std()
    # Plot
    plt.figure(figsize=(15,6))
    plt.plot(timeseries, label='Data')
    plt.plot(roll_mean, color='red', label='Rolling Mean')
    plt.plot(roll_sd, color='black', label='Rolling Standard Deviation')
    plt.title("Rolling Mean and Standard Deviation")
    plt.legend(loc='best', fontsize='small')

    # Dickey-Fuller test
    dftest = adfuller(timeseries['score'], autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', 'n_lags', 'n_observations'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
In [50]:
# Create 1st and 2nd order differences

diff = interp_df.diff()
diff.dropna(inplace=True)

diff2 = interp_df.diff().diff()
diff2.dropna(inplace=True)

log_df = np.log(interp_df)
log_diff = log_df - log_df.shift()
log_diff.dropna(inplace=True)
In [51]:
test_stationarity(diff)
Test Statistic           -2.402649
p-value                   0.140990
n_lags                   12.000000
n_observations          143.000000
Critical Value (1%)      -3.476927
Critical Value (5%)      -2.881973
Critical Value (10%)     -2.577665
dtype: float64
In [52]:
test_stationarity(diff2)
Test Statistic         -6.089455e+00
p-value                 1.044985e-07
n_lags                  1.100000e+01
n_observations          1.430000e+02
Critical Value (1%)    -3.476927e+00
Critical Value (5%)    -2.881973e+00
Critical Value (10%)   -2.577665e+00
dtype: float64
In [53]:
test_stationarity(log_diff)
Test Statistic           -3.199541
p-value                   0.020011
n_lags                    0.000000
n_observations          155.000000
Critical Value (1%)      -3.473259
Critical Value (5%)      -2.880374
Critical Value (10%)     -2.576812
dtype: float64

Tuning ARIMA parameters

In [54]:
# Plot Data, Residuals and ACF functions
#from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Original Series
fig, ax = plt.subplots(3, 2, figsize=(15,8))
plot = ax[0,0].plot(ca_df)
ax[0,0].set_title('Original Series')
acf=plot_acf(interp_df, ax=ax[0,1])

plot2 = ax[1,0].plot(diff)
ax[1,0].set_title('1st Difference')
acf2=plot_acf(diff, ax=ax[1,1])

plot3 = ax[2,0].plot(diff2)
ax[2,0].set_title('2nd Difference')
acf2=plot_acf(diff2, ax=ax[2,1])

plt.tight_layout()
In [55]:
# Original Series
fig, ax = plt.subplots(3, 2, figsize=(15,8))
plot = ax[0,0].plot(interp_df)
ax[0,0].set_title('Original Series')
pacf=plot_pacf(interp_df, lags=15, ax=ax[0,1])
#pacf = plot_pacf(ca_df, lags=15)

plot2 = ax[1,0].plot(diff)
ax[1,0].set_title('1st Difference')
pacf2=plot_pacf(diff, lags=15, ax=ax[1,1])

plot3 = ax[2,0].plot(diff2)
ax[2,0].set_title('2nd Difference')
pacf2=plot_pacf(diff2, lags=15, ax=ax[2,1])

plt.tight_layout()

Try using 1 lag for parameter p, since the first lag is outside of significance bounds.

In [56]:
# ARIMA (Autoregressive Integrated Moving Average) model is used for timeseries forecasting

#from statsmodels.tsa.arima_model import ARIMA
# Fit model
model = ARIMA(interp_df, order=(1,1,5)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
results_AR = model.fit(disp=-1, )
print(results_AR.summary())
                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                D.score   No. Observations:                  156
Model:                 ARIMA(1, 1, 5)   Log Likelihood                 -39.335
Method:                       css-mle   S.D. of innovations              0.310
Date:                Sat, 28 Mar 2020   AIC                             94.671
Time:                        11:46:35   BIC                            119.069
Sample:                    02-29-2004   HQIC                           104.580
                         - 01-31-2017                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.2188      0.213      1.028      0.305      -0.198       0.636
ar.L1.D.score     0.8522      0.069     12.379      0.000       0.717       0.987
ma.L1.D.score     0.0881      0.105      0.839      0.403      -0.118       0.294
ma.L2.D.score     0.0776      0.100      0.772      0.441      -0.119       0.274
ma.L3.D.score     0.0656      0.094      0.698      0.486      -0.119       0.250
ma.L4.D.score     0.0519      0.085      0.614      0.540      -0.114       0.218
ma.L5.D.score     0.0366      0.077      0.477      0.634      -0.114       0.187
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.1735           +0.0000j            1.1735            0.0000
MA.1            1.1877           -1.2817j            1.7474           -0.1311
MA.2            1.1877           +1.2817j            1.7474            0.1311
MA.3           -0.8338           -1.8742j            2.0513           -0.3166
MA.4           -0.8338           +1.8742j            2.0513            0.3166
MA.5           -2.1288           -0.0000j            2.1288           -0.5000
-----------------------------------------------------------------------------

The AIC score is 94.671, and we will compare this value with other orders of the ARIMA model. Minimizing the AIC score equates to a better fit model.

In [57]:
# Loop through values for parameters p,d,q to determine which model produces the lowest AIC

# Print warnings only once
#import warnings
warnings.filterwarnings(action='once')

ps = [0,1,2,3,4,5]
ds=[0,1,2]
qs = [0,1,2,3,4,5]
best_aic = 100
best_p = 0
best_d = 0
best_q = 0
for p in ps:
  for d in ds:
    for q in qs:
      try: 
        model = ARIMA(interp_df, order=(p,d,q)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
        results_AR = model.fit(disp=-1)
        aic = results_AR.aic
        if aic < best_aic:
          best_p = p
          best_d = d
          best_q = q
          best_aic = aic
          print(p,d, q, aic)
      except:
        pass

'The best p,d,q = {}, {}, {}, with AIC = {}'.format(best_p, best_d, best_q, best_aic)
C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning:

Inverting hessian failed, no bse or cov_params available

0 1 5 94.51866244694133
0 2 0 89.27360855778495
1 1 0 86.30165771430467
C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\base\model.py:512: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

2 1 1 85.15612468431254
C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:668: RuntimeWarning:

invalid value encountered in true_divide

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:695: RuntimeWarning:

invalid value encountered in log

3 1 2 84.59401199003241
C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:693: RuntimeWarning:

invalid value encountered in double_scalars

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:695: RuntimeWarning:

divide by zero encountered in true_divide

4 1 4 84.40573055043802
C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:695: RuntimeWarning:

divide by zero encountered in log

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tools\numdiff.py:243: RuntimeWarning:

invalid value encountered in subtract

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:225: RuntimeWarning:

invalid value encountered in log

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:225: RuntimeWarning:

invalid value encountered in true_divide

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tools\numdiff.py:243: RuntimeWarning:

invalid value encountered in multiply

5 1 3 84.31767259903393
5 1 4 81.36596154565893
Out[57]:
'The best p,d,q = 5, 1, 4, with AIC = 81.36596154565893'
In [58]:
print(best_p, best_d, best_q)
5 1 4

The best model has order (5,1,4) and an AIC score of 81.366, which is significantly lower than the first model with order (1,1,5). ARIMA model (5,1,4) will be used to fit the California subset of scores.

In [59]:
# Fit model
model = ARIMA(interp_df, order=((5,1,3))) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
results_AR = model.fit(disp=-1)
print(results_AR.summary())
                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                D.score   No. Observations:                  156
Model:                 ARIMA(5, 1, 3)   Log Likelihood                 -32.159
Method:                       css-mle   S.D. of innovations              0.289
Date:                Sat, 28 Mar 2020   AIC                             84.318
Time:                        11:47:23   BIC                            114.816
Sample:                    02-29-2004   HQIC                            96.705
                         - 01-31-2017                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.2698      0.042      6.422      0.000       0.187       0.352
ar.L1.D.score     0.2387      0.081      2.935      0.004       0.079       0.398
ar.L2.D.score     1.3100      0.084     15.658      0.000       1.146       1.474
ar.L3.D.score     0.2024      0.128      1.575      0.117      -0.049       0.454
ar.L4.D.score    -0.7226      0.065    -11.043      0.000      -0.851      -0.594
ar.L5.D.score    -0.0695      0.086     -0.812      0.418      -0.237       0.098
ma.L1.D.score     0.7471      0.049     15.156      0.000       0.651       0.844
ma.L2.D.score    -0.7472      0.060    -12.467      0.000      -0.865      -0.630
ma.L3.D.score    -0.9999      0.049    -20.491      0.000      -1.096      -0.904
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -0.9847           -0.5331j            1.1198           -0.4210
AR.2           -0.9847           +0.5331j            1.1198            0.4210
AR.3            1.0400           -0.1027j            1.0450           -0.0157
AR.4            1.0400           +0.1027j            1.0450            0.0157
AR.5          -10.5069           -0.0000j           10.5069           -0.5000
MA.1            1.0000           -0.0000j            1.0000           -0.0000
MA.2           -0.8737           -0.4867j            1.0001           -0.4191
MA.3           -0.8737           +0.4867j            1.0001            0.4191
-----------------------------------------------------------------------------
In [60]:
# Fit model
model = ARIMA(interp_df, order=(best_p, best_d, best_q)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
results_AR = model.fit(disp=-1)
print(results_AR.summary())
                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                D.score   No. Observations:                  156
Model:                 ARIMA(5, 1, 4)   Log Likelihood                 -29.683
Method:                       css-mle   S.D. of innovations              0.283
Date:                Sat, 28 Mar 2020   AIC                             81.366
Time:                        11:47:25   BIC                            114.914
Sample:                    02-29-2004   HQIC                            94.992
                         - 01-31-2017                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.2670      0.046      5.845      0.000       0.177       0.356
ar.L1.D.score     0.0151      0.078      0.194      0.846      -0.137       0.167
ar.L2.D.score     0.9388      0.071     13.209      0.000       0.800       1.078
ar.L3.D.score     0.7804      0.095      8.173      0.000       0.593       0.968
ar.L4.D.score    -0.1339      0.071     -1.895      0.060      -0.272       0.005
ar.L5.D.score    -0.6582      0.067     -9.839      0.000      -0.789      -0.527
ma.L1.D.score     1.0338      0.056     18.369      0.000       0.924       1.144
ma.L2.D.score    -0.0001      0.064     -0.002      0.999      -0.126       0.125
ma.L3.D.score    -1.0339      0.064    -16.226      0.000      -1.159      -0.909
ma.L4.D.score    -0.9998      0.057    -17.663      0.000      -1.111      -0.889
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -1.1162           -0.0000j            1.1162           -0.5000
AR.2           -0.5951           -0.9309j            1.1049           -0.3405
AR.3           -0.5951           +0.9309j            1.1049            0.3405
AR.4            1.0514           -0.0975j            1.0560           -0.0147
AR.5            1.0514           +0.0975j            1.0560            0.0147
MA.1            1.0000           -0.0000j            1.0000           -0.0000
MA.2           -0.5170           -0.8560j            1.0001           -0.3365
MA.3           -0.5170           +0.8560j            1.0001            0.3365
MA.4           -1.0001           -0.0000j            1.0001           -0.5000
-----------------------------------------------------------------------------

The best AIC scores are given when using ARIMA orders of (5,1,4) and (5,1,3). The scores are 81.366 and 84.318 respectively. Both of these scores are significantly better than the initial model with order (1,1,5).

Looking at the p-values of coefficients, the first model with order (5,1,4) has p-values of 0 except for 2 p-values greater than 0.8 for an autoregressive and moving average coefficient. The order (5,1,4) model also resulted in p-values of 0 except for 2 p-values between 0.1 and 0.45. Both models have a couple of coefficients that are not playing a significant part in the model.

In [61]:
# Plot residual errors
residuals = pd.DataFrame(results_AR.resid)
fig, ax = plt.subplots(1,2,figsize=(15,5))
residuals.plot(title="Residuals", ax=ax[0])
residuals.plot(kind='kde', title='Density', ax=ax[1])
# Remove y-label and legends
for ax in ax.flat:
    ax.set(ylabel='')
    ax.get_legend().remove()
plt.show()

The residuals look okay, with bigger spikes at the turn of each year due to using annal average scores, and interpolating linearly between years by month. Residual errors are roughly Gaussian and slightly skewed.

In [62]:
# Residual stats
residuals.describe()
Out[62]:
0
count 156.000000
mean -0.024889
std 0.299852
min -1.441196
25% -0.117959
50% -0.017617
75% 0.054264
max 1.575395
In [63]:
# Actual vs Fitted
with plt.rc_context():
    plt.rc("figure", figsize=(12,5))
    results_AR.plot_predict(dynamic=False)
plt.title('Actual vs. Fitted')
plt.show()

Evaluate Model Performance

In [64]:
# Split the timeseries into contiguous training and test datasets with a 75:25 ratio
split = int(len(interp_df)*.80)
train = interp_df[0:split]
len(train)
Out[64]:
125
In [65]:
test = interp_df[split:]
len(test)
Out[65]:
32
In [66]:
#from numpy.linalg import LinAlgError

# Build Model  
try:
  model = ARIMA(train, order=(best_p, best_d, best_q)) 
  #model = ARIMA(train, order=(5,1,3))  
  fitted = model.fit(disp=-1)  
except (ValueError, LinAlgError):
  pass

# Forecast
fc, se, conf = fitted.forecast(len(test), alpha=0.05)  # 95% conf

# Make as pandas series
fc_series = pd.Series(fc, index=test.index)
lower_series = pd.Series(conf[:, 0], index=test.index)
upper_series = pd.Series(conf[:, 1], index=test.index)

# Plot
plt.figure(figsize=(12,5))
plt.plot(train, label='training')
plt.plot(test, label='actual')
plt.plot(fc_series, label='forecast', color='green')
plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.15)
plt.ylim(0,100)
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=11)
plt.show()
C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning:

Inverting hessian failed, no bse or cov_params available

In [67]:
def squared_error(true, predicted):
  """ Finds the squared error between two points: true value and predicted value """
  squared_err = (true - predicted)**2
  return squared_err
In [68]:
# End of forecast root squared error
np.sqrt(squared_error(test.score[-1], fc_series[-1]))
Out[68]:
7.285318511601872
In [69]:
# Find the Mean Squared Error for all data in test and forecasted series
#from sklearn.metrics import mean_squared_error
error = mean_squared_error(test, fc_series)
print('Total Test MSE: %.3f' % error)
Total Test MSE: 7.844

MSE between the test set and forecasted set is: 7.843 with ARIMA order (5,1,4), and 8.533 with order (5,1,3).

In [70]:
# Evaluate model performance for one year forecasts.
# Create 4x3 subplots
# Calculate the root mean squared error (RMSE) of each 1 year forecast for the years 2007 through 2017. 
# Then print output the mean RMSE.

# Create range of indices to split data on - splits yearly on January
splits = range(24,156,12)
all_rmse = []
fig = plt.figure(figsize=(15,10))

for i in range(len(splits)):
  train = interp_df[0:splits[i]]
  test = interp_df[splits[i]:]
  # Build Model  
  try:
    model = ARIMA(train, order=(best_p, best_d, best_q))  
    fitted = model.fit(disp=-1)  
  except (ValueError, LinAlgError):
    pass

  # Forecast
  fc, se, conf = fitted.forecast(len(test), alpha=0.05)  # 95% conf

  # Make as pandas series
  fc_series = pd.Series(fc, index=test.index)
  lower_series = pd.Series(conf[:, 0], index=test.index)
  upper_series = pd.Series(conf[:, 1], index=test.index)

  # Calculate Root Mean Squared Error
  se = squared_error(test.score[12], fc_series[12])
  rmse = np.sqrt(se)
  #print('The root mean squared error for 1 year forecast is', rmse)
  all_rmse.append(rmse)

  ax = fig.add_subplot(4,3,i+1)
  plt.plot(train, label='training')
  plt.plot(test, label='actual')
  plt.plot(fc_series, label='forecast', color='green')
  plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.15)
  plt.ylim(0,100)
  # Vertical line to highlight 1 year prediction vs. actual score
  plt.axvline(test.index[12], color='red', linestyle='--')
  plt.legend(loc='upper left', fontsize=11)
  #plt.annotate('RMSE =' + str(rmse), (1,1))
  #ax.annotate('RMSE =' + str(rmse), xy=(0.5,0.5))
  
fig.suptitle('Actual vs. Forecasted Score', y=1.05, fontsize='large')
# Set common labels
fig.text(0.5, 0.0, 'Year', ha='center', va='center', fontsize='medium')
fig.text(0.0, 0.5, 'Average Interest Score', ha='center', va='center', rotation='vertical', fontsize='medium')
plt.tight_layout()
C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning:

Inverting hessian failed, no bse or cov_params available

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\base\model.py:512: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:668: RuntimeWarning:

invalid value encountered in true_divide

In [71]:
np.mean(all_rmse)
Out[71]:
10.329052064632911
In [72]:
# Create a dataframe of years and corresponding rmse
predicted_years = np.arange(2007, 2018)
rmse_df = pd.DataFrame({'year': predicted_years, 'rmse':all_rmse})
rmse_df.set_index('year', inplace=True)
#print(rmse_df)
rmse_df.T
Out[72]:
year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
rmse 3.166763 10.51636 14.686363 16.398675 14.916826 12.397931 13.000901 3.337355 4.795352 7.753561 12.649487

(5,1,4) model: The mean squared error for all 1 year forecasts is 10.688652380747168

(5,1,3) model: The mean squared error for all 1 year forecasts is 10.24651994443601

When comparing the 1-year forecasts the ARIMA model with order (5,1,3) performs slightly better.

Forecasting

In [73]:
model = ARIMA(interp_df, order=(best_p, best_d, best_q)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
results_AR = model.fit(disp=-1)
In [74]:
def plot_forecast(n_months=12):
  """ Plot ARIMA forecast, n_months after January 2017 """
  # Create new indices for forecast
  end = interp_df.index[-1]
  new_inds = end + np.arange(n_months) + 1
  # Forecast using fitted model
  fc, se, conf = results_AR.forecast(n_months, alpha=0.05)  # 95% conf
  # Make as pandas series
  fc_series = pd.Series(fc, index=new_inds)
  lower_series = pd.Series(conf[:, 0], index=new_inds)
  upper_series = pd.Series(conf[:, 1], index=new_inds)
  # Plot
  plt.figure(figsize=(12,5))
  plt.plot(interp_df, label='training')
  plt.plot(fc_series, label='forecast', color='green')
  plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.15)
  plt.ylim(0,100)
  plt.title('Forecast')
  plt.xlabel('Year')
  plt.ylabel('Mean Interest Score')
  plt.legend(loc='upper left', fontsize=11)
  plt.show()
  print('The predicted score at end of forecast is =', fc_series[-1])
  return fc_series
In [75]:
fc = plot_forecast(12)
C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: FutureWarning:

Addition/subtraction of integers and integer-arrays to Timestamp is deprecated, will be removed in a future version.  Instead of adding/subtracting `n`, use `n * self.freq`

The predicted score at end of forecast is = 75.34580411939457
In [76]:
fc[-1:]
Out[76]:
2018-01-31    75.345804
dtype: float64
In [77]:
interp_df.tail()
Out[77]:
score
date
2016-09-30 67.166667
2016-10-31 68.145833
2016-11-30 69.125000
2016-12-31 70.104167
2017-01-31 71.083333
In [78]:
fc = plot_forecast(n_months=24)
C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: FutureWarning:

Addition/subtraction of integers and integer-arrays to Timestamp is deprecated, will be removed in a future version.  Instead of adding/subtracting `n`, use `n * self.freq`

The predicted score at end of forecast is = 74.96342224811609
In [79]:
fc = plot_forecast(n_months=36)
C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: FutureWarning:

Addition/subtraction of integers and integer-arrays to Timestamp is deprecated, will be removed in a future version.  Instead of adding/subtracting `n`, use `n * self.freq`

The predicted score at end of forecast is = 76.21549181014365

Automation for All States

In [80]:
grouped_df.head()
Out[80]:
year state score
0 2004 AK 32.000000
1 2004 AL 38.000000
2 2004 AR 31.500000
3 2004 AZ 30.500000
4 2004 CA 35.833333
In [81]:
len(set(grouped_df.state))
Out[81]:
47
In [82]:
def get_state_subset(state_abbrev): 
  """ Gets a subset for the state, input is the two letter state abbreviation as a string """
  state_df = grouped_df[grouped_df.state == state_abbrev]
  state_df = state_df.drop('state', axis=1)
  return state_df
In [83]:
def interpolate_df(df):
  """ Converts year column into datetime object, and then upsamples by month and interpolates linearly """
  # Use datetime formatting
  df['year'] = pd.to_datetime(df['year'], format="%Y")
  # Rename column 'year' to 'date'
  df.columns = ['date', 'score']
  df.set_index('date', inplace=True)
  interp_df = df.resample('M').mean().interpolate(method='linear')
  return interp_df
In [84]:
def get_best_model(interpolated_df):
  """ Fits model to interpolated dataframe, and returns the best parameters for p,d,q, along with the fitted ARIMA model. """
  ps = [0,1,2,3,4,5]
  ds=[0,1,2]
  qs = [0,1,2,3,4,5]
  best_aic = 500
  best_p = 0
  best_d = 0
  best_q = 0
  for p in ps:
    for d in ds:
      for q in qs:
        try: 
          model = ARIMA(interpolated_df, order=(p,d,q)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
          fitted_AR = model.fit(disp=-1)
          aic = fitted_AR.aic
          if aic < best_aic:
            best_p = p
            best_d = d
            best_q = q
            best_aic = aic
            #print(p,d, q, aic)
        except:
          pass
  model = ARIMA(interpolated_df, order=(best_p,best_d,best_q))
  best_fitted = model.fit(disp=-1)
  return best_p, best_d, best_q, best_fitted
In [85]:
def one_yr_forecast(interp_df, fitted_model, n_months=12):
  """ Forecast, returns the forecasted value n_months after January 2017.
  The default function will return a 12 month / 1 year forecast value. """
  # Create new indices for forecast
  end = interp_df.index[-1]
  new_inds = end + np.arange(n_months) + 1
  # Forecast using fitted model
  fc, se, conf = fitted_model.forecast(n_months, alpha=0.05)  # 95% conf
  # Make as pandas series
  fc_series = pd.Series(fc, index=new_inds)
  #lower_series = pd.Series(conf[:, 0], index=new_inds)
  #upper_series = pd.Series(conf[:, 1], index=new_inds)
  #print('The predicted score at end of forecast is =', fc_series[-1])
  return fc_series[-1]
In [86]:
def plot_forecast_2(interp_df, fitted_model, n_months=12):
  """ Plot ARIMA forecast, n_months after January 2017 """
  # Create new indices for forecast
  end = interp_df.index[-1]
  new_inds = end + np.arange(n_months) + 1
  # Forecast using fitted model
  fc, se, conf = fitted_model.forecast(n_months, alpha=0.05)  # 95% conf
  # Make as pandas series
  fc_series = pd.Series(fc, index=new_inds)
  lower_series = pd.Series(conf[:, 0], index=new_inds)
  upper_series = pd.Series(conf[:, 1], index=new_inds)
  # Plot
  plt.figure(figsize=(12,5))
  plt.plot(interp_df, label='training')
  plt.plot(fc_series, label='forecast', color='green')
  plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.15)
  plt.ylim(0,100)
  plt.title('Forecast')
  plt.xlabel('Year')
  plt.ylabel('Mean Interest Score')
  plt.legend(loc='upper left', fontsize=11)
  plt.show()
  print('The predicted score at end of forecast is =', fc_series[-1])
  return fc_series
In [87]:
state_df = get_state_subset('NC')
state_df.head()
Out[87]:
year score
26 2004 34.8
73 2005 33.0
120 2006 38.4
167 2007 40.6
214 2008 25.8
In [88]:
interp_df = interpolate_df(state_df)
interp_df.head()
Out[88]:
score
date
2004-01-31 34.80
2004-02-29 34.65
2004-03-31 34.50
2004-04-30 34.35
2004-05-31 34.20
In [89]:
p,d,q,fitted_AR = get_best_model(interp_df)
print(p,d,q)
C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:695: RuntimeWarning:

divide by zero encountered in true_divide

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:693: RuntimeWarning:

invalid value encountered in double_scalars

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:695: RuntimeWarning:

invalid value encountered in log

5 1 4
In [90]:
fc = plot_forecast_2(interp_df, fitted_AR, n_months=12)
C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: FutureWarning:

Addition/subtraction of integers and integer-arrays to Timestamp is deprecated, will be removed in a future version.  Instead of adding/subtracting `n`, use `n * self.freq`

The predicted score at end of forecast is = 76.97414850488222
In [91]:
one_yr_forecast(interp_df, fitted_AR, n_months=12)
C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: FutureWarning:

Addition/subtraction of integers and integer-arrays to Timestamp is deprecated, will be removed in a future version.  Instead of adding/subtracting `n`, use `n * self.freq`

Out[91]:
76.97414850488222
In [92]:
nc = get_state_subset('NC')
nc_interp = interpolate_df(nc)
p,d,q,fitted = get_best_model(nc_interp)
nc_fc = one_yr_forecast(nc_interp, fitted)
C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: FutureWarning:

Addition/subtraction of integers and integer-arrays to Timestamp is deprecated, will be removed in a future version.  Instead of adding/subtracting `n`, use `n * self.freq`

In [93]:
#import time

predicted_df = pd.DataFrame(columns=['year', 'state', 'score']) # Initialize dataframe to collect predictions
predicted_yrs = [2018] # Year(s) to predict

# A loop to find the predicted value for all states in predicted_yrs (2018)
start = time.time() # Time the loop
for state in set(grouped_df.state):
  for yr in predicted_yrs:
    state_df = get_state_subset(state) # Get the subset for each state
    interp_df = interpolate_df(state_df) # Interpolate the subset
    p,d,q,fitted_AR = get_best_model(interp_df) # Get the best fit model and parameters (p,d,q)
    fc = one_yr_forecast(interp_df, fitted_AR, n_months=12) # Get the 1 year forecasted value
    predicted_df = predicted_df.append({'year':yr, 'state':state, 'score':fc}, ignore_index=True)

print("Total time taken this loop: ", time.time() - start) # Print the time it takes to complete the loop
C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:695: RuntimeWarning:

divide by zero encountered in log

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:693: RuntimeWarning:

divide by zero encountered in double_scalars

C:\Users\chant\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: FutureWarning:

Addition/subtraction of integers and integer-arrays to Timestamp is deprecated, will be removed in a future version.  Instead of adding/subtracting `n`, use `n * self.freq`

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tools\numdiff.py:243: RuntimeWarning:

invalid value encountered in subtract

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:225: RuntimeWarning:

invalid value encountered in log

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:225: RuntimeWarning:

invalid value encountered in true_divide

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tools\numdiff.py:243: RuntimeWarning:

invalid value encountered in multiply

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:221: RuntimeWarning:

divide by zero encountered in true_divide

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:624: RuntimeWarning:

overflow encountered in tanh

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1358: RuntimeWarning:

invalid value encountered in sqrt

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:651: RuntimeWarning:

divide by zero encountered in arctanh

C:\Users\chant\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:651: RuntimeWarning:

invalid value encountered in arctanh

Total time taken this loop:  2335.730831861496
In [94]:
predicted_df.head()
Out[94]:
year state score
0 2018 NE 72.129127
1 2018 CA 75.345804
2 2018 VA 83.318674
3 2018 MI 80.402315
4 2018 NC 76.974149
In [95]:
# Create a choropleth map
#import plotly.express as px  # Be sure to import express
#import plotly.graph_objects as go

#configure_plotly_browser_state()
fig = px.choropleth(predicted_df,  # Input Pandas DataFrame
                    locations="state",  # DataFrame column with locations
                    color="score",  # DataFrame column with color values
                    hover_name="state", # DataFrame column hover info
                    locationmode = 'USA-states', # Set location to US states
                    range_color=[20,100] # Set range for colorbar
                    ) 
fig.update_geos(showlakes=False) # Hide the lakes
fig.update_layout(
    title_text = 'Predicted Interest Scores for Vaccines (2018)', # Create a Title
    geo_scope='usa'  # Plot only the USA instead of globe
)
fig.show()  # Output the plot to the screen